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1. INTRODUCTION 

Cancer has become the leading cause of mortality all over the world. Cancer is such a deadly disease 
that an approximated 10.0 million people died from it in 2020 [1]. To put it another way, cancer causes 
approximately one in every half-dozen death all over the world. Also, cancer is a general term that refers to 
various diseases that might affect any part of the body. The quick creation of abnormal cells in the body 
which develop beyond their natural boundaries and that could subsequently penetrate neighboring parts of the 
body and spread to other organs is a simple and strong defining trait of cancer. This process is known as 
metastasizing. Cancer fatalities are frequently caused by metastases [2]. To put it another way, a tumor might 
be classified as malignant or benign. In addition, a malignant tumor is a cancer type that spreads through the 
lymph system and blood to other organs and tissues. Lung cancer, colorectal prostate cancer, skin cancer, 
breast cancer, and stomach cancer are the most prevalent cancers in humans. As stated by the World Health 
Organization’s (WHO's) International Agency for Research on Cancer (IARC), 2.3 million incidences of 
breast cancer disease (BCD) were reported in 2020 [1]. Breast cancer is not just a female-only disease; men 
are also susceptible to it. However, statistics show that females have a higher risk of breast cancer than 
males. 

Female breast cancer has been named the sixth leading cause of death in women (6.6%, 627000 
deaths) [3]. As people get older, their chances of developing breast cancer increase dramatically. Early 
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diagnosis of cancer saves lives and reduces treatment costs, according to the renowned great saying, 
"prevention is better than cure." As a result, many studies worldwide are making extraordinary attempts to 
combat the disease through developing detection and prediction technologies for effective therapy. Machine 
learning (ML) approaches are of high importance in disease prediction in this regard. The best method for the 
labelled data is classification, which is a supervised ML method [4], [5]. Therefore, depending on the test 
results of patients, the classification approaches might be used for predicting the disease. Various attempts 
have previously been made to benchmark the precision of classifier results on a variety of disease datasets, 
but more analysis regarding the classifier performance evaluation on the BCD dataset is still needed. Now, it 
is clear that BCD prediction is considered as a two-class problem, with malignant and benign classes. 
Similarly, apart from the class label, the BCD dataset is specified to be quantitative and includes continuous 
values. This study aims to give a clear picture of the best classifier model among 12 candidates (logistic 
regression [6], support vector machines (SVM) [7], k-nearest neighbor (k-NN) [8], random forest [9], multi- 
layer perceptron (MLP) [10], Gaussian Naive Bayes (NB) [11], decision tree [12], MLP regression [13], 
perceptron [14], linear recognition [15], extreme gradient boosting (XGboosting) [16], and gradient boosting 
[17]) that might be utilized for predicting the most accurate results with the use of the Breast Cancer 
Wisconsin (diagnostic) dataset. This work utilized four characteristics to achieve a robust evaluation: 
accuracy, recall, precision, and Fl-score [18], [19]. 

Various studies have concentrated on breast cancer because it is regarded as one of the major 
dangerous diseases that have rapidly spread worldwide. Different research was carried out, with varying 
outcomes that have improved over time. We'll go over a few examples of such research in detail, focusing on 
the datasets and ML approaches they employed, as well as the accuracy of their findings. 

Alghunaim and Al-Baity [20] used 3 classification techniques, which are decision tree (DT), SVM, 
and random forest (RF), to develop 9 models which aid in breast cancer prediction. They used 3 scenarios 
with the use of diabetes mellitus (DM), gastric emptying (GE), and GE and DM combined to see which of 
the 3 forms of data might yield the greatest result with regard to error rate and accuracy. According to the 
testing results, the scaled SVM classifier in the spark environment outperformed the other classifiers in terms 
of error rate and accuracy using the GE dataset. A total of 7 supervised ML approaches were evaluated in 
[21], with regard to precision, accuracy, recall/sensitivity/true positive (TP) rate, specificity, negative 
predictive value, false-positive rate (FPR), Fl-score, rate of misclassification, and receiver operating 
characteristic (ROC) curve, for the purpose of finding the best model for BCD prediction. Results 
demonstrate that KNN is considered the best performer on the data-set of the BCD, with a 97% accuracy. 
Although NB performed similarly to KNN, its precision was not as high as KNN's. With a 94% classification 
accuracy, the DT classifier was the worst performer. RF, SVM, logistic regression, and ANN all performed 
in the middle of the NB and DT performance ranges. 

Zhang et al. [22] describe an unsupervised feature learning framework for identifying various traits 
from gene expression profiles through combining a principal component analysis (PCA) technique and an 
autoencoder NN. As the foundation for the collected characteristics, an ensemble classifier based on the 
AdaBoost algorithm (PCA-AE-Ada) was built. Throughout the studies, they created an additional classifier 
that used the same classifier learning technique PCA-Ada as the suggested approach, with the only variation 
being the training inputs. On many independent breast cancer datasets, the suggested approach's area under 
the receiver operating characteristic curve index, Matthew's correlation coefficient index, accuracy, and other 
parameters of the evaluation have been tested and put to comparison against the representative gene 
signature-based algorithms, such as the base-line technique. Experiments show that the suggested technique, 
which employs deep learning (DL) approaches, and outperforms others. Two of the most common ML 
approaches were employed to classify the Wisconsin Breast Cancer (original) dataset in [23], while each 
approach's classification performance was put to comparison with values of precision, accuracy, ROC area, 
and recall. The SVM approach produced the greatest results with the maximum accuracy. 

Naji et al. [24] proposed five ML approaches on the breast Cancer Wisconsin Diagnostic dataset: 
RF, SVM, logistic regression, DT (C4.5), and KNN. After getting the results, the authors performed a 
performance comparison and evaluation between the 3 classifiers. The major goal of this study was to use 
ML approaches for predicting and diagnosing breast cancer, and to determine which ones were the most 
efficient in terms of accuracy, confusion matrix, and precision. SVM was found to outperform all other 
classifiers and reach the best accuracy of 97.2%. A total of six supervised ML methods are presented in [25], 
including k-NN, DT, logistic regression, RF, and SVM with radial basis function (RBF) kernel. DL with the 
use of Adam gradient descent learning has also been used since it combines the advantages of the adaptive 
gradient approach with root mean square propagation. Each of the models has a distinctive hyper-parametric 
change that improves accuracy both within the model and when comparing it to other models. DL produces 
the most precise results with the least amount of loss. DL with Adam gradient descent learning has an 
accuracy rate of 98.24%. 
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Arya and Saha [26] suggested gated attentive DL models stacked with RF classifiers to improve 
breast cancer prognostic prediction using informative features and multi-modal data. It's a bi-phase model: 
phase one generates stacked features using a sigmoid gated attention convolutional neural network (CNN) 
and phase two delivers the stacked features to the RF classifier. A comparison of the proposed and other 
current approaches reveals significant improvements in the estimation of survival of breast cancer patients, 
with a 5.1% increase in sensitivity values. 

Yang et al. [27] acquired 287 stages I breast cancer cases and divided them into two groups: test 
(N=90) and training (N=197). A total of four accessible microarray datasets yielded 14 potential genes. After 
choosing a superior algorithm, a prediction model has been created utilizing these 14 candidate genes and 3 
reference genes whose expressions have been tested with the use of TaqMan probe-based quantitative 
polymerase chain reaction. When put to comparison with SVM, RF, k-NN algorithms, and the NB algorithm 
exhibited a greater predictive value (P less than 0.05). This 17-gene model had a strong positive connection 
with PCR (odds ratio, 8.914, 95% confidence interval, 4.43—17.934, and P 0.001). With the use of this 
approach, the enrolled patients have been divided into insensitive (INS) and sensitive (SE) groups. The INS 
and SE groups had significantly different polymerase chain reaction (PCR) rates (42.3% vs. 7.6%, P 0.001). 
This prediction model's specificity and sensitivity were 62.0% and 84.5%, respectively. Panel gene 
expression with tens of important genes applied in an ML model provides predictive potential for the chemo- 
sensitivity of breast cancer rather than the entire transcriptome-based approaches. Our paper examines the 
behaviour of twelve-candidate classifiers (logistic regression, SVM, RF, Gaussian NB, KNN, MLP 
regression, MLP, DT, perceptron, linear recognition, XGboosting, and gradient boosting) for the prediction 
of BCD. In future samples, we want to see which classifiers are the most accurate at predicting breast cancer. 


2. PROPOSED METHOD 
2.1. Description and distribution of the dataset 

The Breast Cancer Wisconsin (diagnostic) dataset was downloaded from the Kaggle machine 
learning and data science community website (https://www.kaggle.com/uciml/breast-cancer-wisconsin-data) 
for this empirical investigation. There are a total of 569 instances in this collection, each of which has 33 
attributes. There are 212 malignant and 357 benign samples. Table 1 contains a summary of the dataset's 
features. With regard to each cell nucleus, a total of 10 real-valued characteristics are evaluated: i) texture 
(standard deviation of grey-scale values), ii radius (average of the distance values from the center to the 
points on the perimeter), iii) area, iv) perimeter, v) smoothness (local variation in the lengths of the radius), 
vi) compactness (perimeter2/area-1), vii) concavity (contour’s concave portions’ severity), viii) concave 
points (number of the contour’s concave portions), ix) asymmetry, and x) fractal dimension ("coast-line 
approximation"-1). 

For every one of the images, the standard error, mean, and "worst" or largest (average of the 3 
largest values) regarding such features have been calculated, yielding 30 features. For instance, field 3 
represents mean radius, field 13 represents radius SE, and field 23 represents the worst radius. All feature 
values have 4 significant digits recoded. There are no missing attribute values. 


Table 1. Dataset features information 


— ID — radius_worst — fractal_dimension_mean 
— radius_mean — perimeter_worst — texture_se 

= perimeter_mean — smoothness_worst - area_se 

- smoothness _mean - concavity_wors - compactness_se 

= concavity_mean —  symmetry_worst = concave points_se 

— symmetry_mean — Unnamed = fractal_dimension_se 
— radius_se — Diagnosis — texture_worst 

- perimeter_se — texture_mean = area_worst 

- smoothness_se —  area_mean - compactness_worst 
- concavity_se — compactness_mean = concave points worst 
— symmetry_ se — concave points mean — fractal_dimension_se 


As demonstrated in Figure 1, we give a determination of whether the variables in the dataset have 
any correlation in this dataset. After that, as can be seen in Figure 2, we plot the diagnosis result to see if it is 
malignant=(1) or benign=(0). After that, as indicated in the figures, we display the primary features that are 
significant in evaluating whether a tumor is malignant or benign: texture mean Figure 3, perimeter mean 
Figure 4, smoothness mean Figure 5, compactness mean Figure 6, and symmetry mean Figure 7. 
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2.2. Description of classification methods 

A brief summary of each classification model's method was provided below to provide basic 
information regarding 12 classifiers. A decision tree classifier can be defined as one of the flowchart 
structures in which every node represents a test over an attribute value, every one of the branches represents a 
result of test activity, and the tree's leaf nodes represent classes. It's worth noting that no two root-to-leaf 
paths must contain identical discrete attributes. 

The k-NN model of prediction has been characterized as a lazy learning (no learning) approach- 
based predictive mechanism that produces predictions depending on the k-NN provided to it. In the case 
when the prediction of any instance is requested, the whole process of prediction is completed. The Euclidian 
distance [28] is frequently used to determine closeness. 

SVMs are substantially more successful when the dataset has a high number of characteristics. A 
hyperplane is created with the use of important training tuples to define the data segregation in higher 
dimensional space, while such training tuples have been treated as support vectors. In brief, the SVMs are 
based upon the concept of "margin." In addition, a hyperplane will always split two data labels that are on 
opposite sides of it. The goal is to maximize the margin to create a large enough likely gap between the 
instances and segregate the hyperplane on each of the sides. The SVMs may be defined as in (1). 


d(XT) = Ðl. YixiXi|XT + b0 (1) 


Instead of predictions, the logistic regression model produces probability approximations. With 
regard to the problem of binary classification, this model is appropriate. The probability of any event 
occurring has been handled as a linear function of a collection of input features in the model. For determining 
the actual class label, the logistic regression model calculates p for a linear combination of independent 
factors. In (2) is a representation of the estimated regression model. 


eB0+B1x1 
= TyeBb0+Pixt (2) 


The random forest process works by first constructing many DTs and combining them to find steady 
and strong predictions. A random forest might handle both regression and classification problems efficiently. 
The perceptron is a parameterized function that takes a real-valued vector as input and creates a Boolean 
output, as presented in [29]. The output is particularly obtained through thresholding a linear function 
regarding the input: the Perceptron's parameters are the coefficients of the linear function. Gauss distributions 
[30] are used to express the likelihoods regarding the features that have been conditioned on classes, a 
common technique for handling continuous attributes in NB classification. As a result, a Gaussian probability 
density function (PDF) is used to define each property as in (3): 


Xi ~ N(u, 02) (3) 


The Gaussian PDF is shaped like a bell and is specified via the equation: 


1 Œw" 


e202 (4) 


N(u, 02)(X) = 


V2ro^2 


In which u represents the mean and o2 represents the variance. In NB, the parameters required are in the 
order of O(nk), in which n represent the number of attributes and k represent the number of the classes. 
Particularly, it is required to specify a normal distribution P(Xi|C) ~N (u, 02) for every one of the continuous 
attributes. The parameters of those normal distributions obtained by (5) and (6): 


1 
uxi|C =c = ait (5) 
o2Xi|C =c = = Ne xi - u^2 (6) 


In which Nc represent the number of examples where C=c and N representing the number of the total 
examples that are utilized for training. Estimating P (C=c) for all classes is easy with the use of the relative 
frequencies such that: 


P(C =c) = ~ (7) 
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The most frequent and widely used feedforward NN is the MLP network. MLP networks’ key processing 
elements are neurons. In addition, neurons in MLP networks are coupled in a |-directional manner through 
connections known as 'weights' [31]. In (8) is used for calculating the MLP network's output: 


Sj = X} (Wij, Xi) — 6j, j= 1,2,...h (8) 


In which Wij represents the weight connecting ith node (in the input layer) to jth- node (in the hidden layer), 
Oj denotes jth bias node (in the hidden layer), and Xi represents input to ith node (in the input layer). The 
output of every one of the hidden nodes has been estimated with the use of the sigmoid function. The 
multilayer perceptron regressor (MLPR) technique is a regression-process-specific application of ANNs. It 
uses the Waikato environment for knowledge analysis (WEKA) optimization class for training a multi-layer 
perceptron with | hidden layer for the minimization of the loss function that has been chosen. MLPR uses 
logistic functions as activation functions for all of the units except the output one, and it also uses 
standardization for rescaling the target attribute. Small, regularly distributed random values are used for 
initializing all network parameters. 

The most common one amongst the prediction models for the determination of associations between 
the variables is linear regression. The notion is linear, despite the fact that the data is multivariate or 
univariate. Simple linear regression and multiple linear regression are two types of linear regression. In (9) 
describes the linear regression. 


Y=xBte (9) 


Chen and Guestrin [32], suggested XGBoost in 2016. It was identified as an advanced estimator 
with ultra-high performance in both regression and classification, and it presents significant advantages over 
typical gradient boosting algorithms. In contrast to gradient boosted decision trees (GBDT), the loss function 
in the XGBoost has included regularization for preventing overfitting: 


Lk(f (xi) = DLV (yi, Fk(xi)) + Dhar CFR) (10) 


In which, FK (xi) represent the prediction on i sample at the K* boost, ¥ (*) represent a loss function that 
evaluates differences between the actual and the prediction labels. Q (fk) represent the term of regularization 
and could be represented as: 


ACF) =YT +1/2A\|W]II2 (11) 


In the term of the regularization, y represents the parameter of complexity as well as complexity of the 
leaves. T represents the number of the leaves, à represents the parameter of the penalty, and H œ II 2 
represents output of every one of the leaf nodes. In addition to that, different from GBDT, XG-boost adopts a 
2™_order Taylor series as objective function. 


Lk = Yia [Œ i € YjGi)wj + 1/2 (Di € lj hi + Aw2] + YT (12) 


In which hi and gi represent second- and first-order gradient statistics on loss function, respectively. 
Assuming that Ij represent the sample set of leaf j. 


Lk = Yia [Œ i € YjGi)wj + 1/2 (Di € Ij hi + A)w^2] + YT (13) 


Finally, the objective function is turned into the minimum of a quadratic function determination 
problem. XGBoost utilizes learning rate, maximum tree depth, boosting numbers, and subsampling for 
tackling the over-fitting problem, similar to GBDT. Gradient boosting can be defined as an approach that is 
used as part of an ensemble. This approach integrates various predictors in a sequential manner with certain 
shrinking. Each one of the iterations regarding the randomly-selected training set is tested against the base 
model in gradient boosting. Through randomly subsampling the training data, the accuracy and speed of 
gradient boosting for execution could be increased. As an ensemble technique, gradient boosting can be 
characterized in as: 


Y=u + 14 _, vh(y;X) +e, (14) 
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To successfully estimate any model's performance, specific performance measures must be 
developed that may be utilized to assess the goodness of any classifier under evaluation. To evaluate the 
usefulness of classifiers in this paper, four different performance measures were applied to achieve a robust 
evaluation: accuracy, recall, precision, and Fl-score. 


3. RESULTS AND DISCUSSION 

The experiment used the Breast Cancer Wisconsin (diagnostic) dataset to diagnose BCD with the 
use of several classifiers with the following sampling types: tenfold cross-validation, stratified shuffle split, 
and random samples with a 75% training data-set size. As can be seen in Figure 8, the results from such 
classifiers: RF, k-NN, gradientB, Gaussian NB, MLP, linear regression, XGboost, MLP regression, and 
linear regression all had an 89% classification accuracy. While, SVM and perceptron had a higher 
classification accuracy of 90%, DT had the lowest accuracy of 87%. Fl-score that presents combined result 
of precision and recall ratios, its higher value represents better classification capability of a model, the 
classifiers-gradientB, MLP, MLP regression, linear regression, XGboost and linear regression show equal 
Fl-score value, i.e., 94%, while RF, perceptron, k-NN, and gaussian NB shows the Fl-score value of 92%, 
and DT, SVM shows the value of 93%. 
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Figure 8. Comparison of performance evaluation 


It must be noted that whereas the perceptron and SVM approaches provided higher sensitivity and 
accuracy in Figure 8, the linear regression generated the best results in terms of Fl-score and precision, 
whereas the DT generated less sensitivity and accuracy. In addition, the KNN had a lower precision and F1- 
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score. In terms of support, k-NN scores the most, while the gradian boosting, DT scores the lowest. The 
SVM and perceptron classifiers have been trained on the dataset and obtained the highest performance 
classifier for the prediction of new cases among all classifiers, according to this comparison. 

The performance of the proposed classifiers was evaluated using common validation metrics: 
accuracy, recall, precision, and F-measure. The separation of related and unrelated components is used in 
these measurements. The validation findings in Figure 8 show that the papers' categorisation was correctly 
labeled. The performance of the proposed classifiers using SVM, KNN, NB, and MLP algorithms was 
compared to the same algorithms with ensemble of filters (EoF) as well as the adaptive mutation enhanced 
elephant herding optimization+Kernel extreme learning machine (AMEHO+KELM) classifier based on 
Wisconsin diagnostic Breast Cancer (WDBC) and Wisconsin original Breast Cancer (WOBC) datasets [33]. 
As it is evident in Figure 9, the MLP classifier achieved outstanding performance in the resultant clusters, 
compared to the AMEHO+KELM classifier and other classifiers. 
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Figure 9. Comparison of performance evaluation with WDBC and WOBC datasets 


4. CONCLUSION 

Cancer can be defined as a disease that kills a lot of people. One of the various cancer types is the 
breast cancer. Early identification of cancer not just saves lives, yet also minimizes treatment costs. In the 
disease's prediction, a reliable prediction system is quite effective. A total of 12 alternative supervised ML 
algorithms were evaluated in this study (with regard to precision, accuracy, recall/sensitivity/TP rate, and F1- 
score) so as to discover the best model for BCD prediction. On the Breast Cancer Wisconsin (diagnostic) 
Dataset, the perceptron and SVM approaches provided higher sensitivity and accuracy, whereas linear 
regression generated the best results for precision and fl-score, whereas the DT generated less sensitivity and 
accuracy. In addition, the k-NN had a lower precision and fl-score. The SVM and perceptron classifiers have 
been trained on the dataset and reached the highest performance classifier for the prediction of new cases 
among all classifiers, according to this comparison. Perceptron and SVM are the most accurate predictors 
among the many ML approaches, with accuracy of 90%. Based on recall, accuracy, precision, fl-measure, 
and other factors, both perceptron and SVM were able to demonstrate their efficiency. The focus of future 
work will be on developing a better prediction model utilizing ensemble approaches and fine-tuning 
ensemble techniques for improving model performance. 
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